Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity
نویسندگان
چکیده
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well less-resourced ones Welsh, Kiswahili). Each language set is annotated the relation of semantic similarity contains 1,888 semantically aligned concept pairs, providing representative coverage word classes (nouns, verbs, adjectives, adverbs), frequency ranks, intervals, fields, concreteness levels. Additionally, owing to alignment concepts across we provide suite 66 crosslingual sets. Because its extensive size coverage, Multi-SimLex provides entirely novel opportunities experimental analysis. On monolingual benchmarks, evaluate analyze wide array recent state-of-the-art representation models, static contextualized embeddings (such fastText, multilingual BERT, XLM), externally informed representations, fully unsupervised (weakly) supervised embeddings. also present step-by-step creation protocol creating consistent, Multi-Simlex–style resources additional languages. make these contributions—the public release sets, their protocol, strong baseline results, in-depth analyses which can be helpful in guiding future developments semantics learning—available via Web site that will encourage community effort further expansion Multi-Simlex many more Such could inspire significant advances NLP
منابع مشابه
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with on...
متن کاملLexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages
The last two decades have seen the development of various semantic lexical resources such as WordNet (Miller, 1995) and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Recently, increasing efforts have been devoted to extending the semantic frameworks of existing lexical knowledge resource...
متن کاملSemantic access in number word translation: the role of crosslingual lexical similarity.
The revised hierarchical model of bilingualism (e.g., Kroll & Stewart, 1994) assumes that second language (L2) words primarily access semantics through their first language (L1) translation equivalents. Consequently, backward translation from L2 to L1 should not imply semantic access but occurs through lexical wordform associations. However, recent research with Dutch-French bilinguals showed t...
متن کاملSimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation
We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar (Freud, psychology) have a l...
متن کاملHarmonised large-scale syntactic/semantic lexicons: a European multilingual infrastructure
The paper aims at providing an overview of the situation of Language Resources (LR) in Europe, in particular as emerging from a few European projects regarding the construction of large-scale harmonised resources to be used for many applicative purpose, also of multilingual nature. An important research aspect of the projects is given by the very fact that the large enterprise described is, at ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computational Linguistics
سال: 2021
ISSN: ['1530-9312', '0891-2017']
DOI: https://doi.org/10.1162/coli_a_00391